Andres Delgadillo

Project Supervised Learning : Classification

1 Project: Personal Loan Campaign Modeling

1.1 Objective

1.2 Data Dictionary

2 Import packages and turnoff warnings

3 Import dataset and quality of data

All columns are numeric and there are not missing values or duplicated rows

4 Exploratory Data Analysis

4.1 Pandas profiling report

We can get a first statistical and descriptive analysis using pandas_profiling

Pandas Profiling report is showing some warnings/characteristics in the data:

4.2 Univariate Analysis

4.2.1 Age

4.2.2 Experience

4.2.3 Income

4.2.4 Family

4.2.5 CCAvg

4.2.6 Education

4.2.7 Mortgage

4.2.8 Personal_Loan

4.2.9 Securities_Account

4.2.10 CD_Account

4.2.11 Online

4.2.12 CreditCard

4.3 Pairplot.

We are going to perform bivariate analysis to understand the relationship between the columns

4.4 Bivariate and Multivariate Analysis

4.4.1 Income and Personal Loan

4.4.2 Age and Personal Loan

4.4.3 CCAvg and Personal Loan

Mortgage and Personal Loan

4.4.5 Family and Personal Loan

4.4.6 Education and Personal Loan

4.4.7 Securities Account and Personal Loan

4.4.8 CD Account and Personal Loan

4.4.9 Online banking and Personal Loan

4.4.10 Credit Card and Personal Loan

5 Data Pre-Processing

5.1 Mortgage

Now, we can keep Mortgage_cat column and drop the mortgage column

5.2 Skewed distributions and Outliers detection

Income and CCAVG have skewed distributions that can affect the performance of the model. We are going to analyze whether the distributions are caused by outliers or a log transformation would improve the distributions

5.2.1 Age

Age column is not skewed and there are not outliers. We will keep the column without transformation

5.2.2 Income

We will keep the log transformation for the Income column

5.2.3 CCAvg

We will keep the log transformation for the CCAvg column

5.3 Outliers treatment

For Income_log and CCAvg_log columns, values smaller than the lower whisker will be assigned the value of the lower whisker, and values above the upper whisker will be assigned the value of the upper whisker.

5.4 Data Preparation

5.4.1 Creating training and test data sets

Both, training set and test set have similar ratios of classes for Personal_Loan

6 Models evaluation criteria

6.1 Insights:

6.1.1 Model can make wrong predictions as:

  1. Predicting a customer has a Personal Loan but actually the customer does not have a Personal Loan
  2. Predicting a customer does not have a Personal Loan but actually the customer has a Personal Loan

6.1.2 Which case is more important?

6.1.3 How to reduce this loss i.e need to reduce False Negatives?

6.2 Functions to evaluate models

7 Logistic Regression

7.1 ROC-AUC

Now we are going to analyze the ROC-AUC in the training and test sets

7.1.1 Training set

7.1.2 Test set

7.2 Coefficients

Now we are going to determine the coefficients for each variable

7.2.1 Coefficients with Positive Impact

7.2.2 Coefficients with Negative Impact

Coefficients for Securities Account, Online, Credit Card and several ZIP codes are negative. Increase in these will lead to decrease in chances of a customer getting a Personal Loan

7.2.3 Coefficients with No Impact

There are some Zip Codes with coefficients equal to 0. This zip codes have no impact on a customer getting a Personal Loan

7.2.4 Converting coefficients to odds

The coefficients of the logistic regression model are in terms of log(odd), to find the odds we have to take the exponential of the coefficients.

7.2.5 Coefficient Interpretation

7.3 Model Performance Improvement

The Recall values in the train set and test set can be higher. Higher Recall value implies lower number of False Negative.

7.3.1 Optimal threshold using AUC-ROC curve

7.3.2 Precision-Recall curve

7.3.3 Sequential Feature Selection

The original model has 476 independent variables. We are going to use Sequential Feature Selection to get the 35 most important features, and reduce dimensionality and discard deceptive features

Most Important Features

Retraining the model

Performance

7.4 Model Performance Summary

8 Decision Tree

8.1 Creating training and test data sets

We are going to use the original dataset for the decision tree model

8.2 Build Decision Tree Model

8.3 Visualizing the Decision Tree

8.4 Important features

8.5 Model Improvement

We are going to use Grid search to compute the optimal values of hyperparameters in order to reduce over-fitting in the Decision Tree model

8.5.2 Cost Complexity Pruning

Now we are going to improve and reduce the complexity of the Decision Tree using cost complexity pruning identifying the optimal ccp_alpha parameter

We are going to train a decision tree using the effective alphas

Now, we are going to analyze how the number of nodes and depth of tree reduces with higher alphas

Now, we are going to compute the Recall value for train and test sets for each one of the different Decision Trees associated with each alpha

Now, lets calculate the alpha that gives the maximum Recall score in the test set and save the correspondent Decision Tree

8.6 Model Performance Summary

9 Comparison Logistic Regression and Decision Tree

Now, we are going to compare the results for all models

10 Exploratory Data Analysis on the incorrectly predicted data

10.1 Decision Tree Rules

Since the number of rows with incorrect prediction is small. We are going to use the Decision Tree rules to analyze these values

11 Conclusions and Advice to grow business